Skip to content

fix(ci): eliminate image-tag race between concurrent workflows#1413

Open
mesutoezdil wants to merge 1 commit into
NVIDIA:mainfrom
mesutoezdil:fix/1343-image-tag-race
Open

fix(ci): eliminate image-tag race between concurrent workflows#1413
mesutoezdil wants to merge 1 commit into
NVIDIA:mainfrom
mesutoezdil:fix/1343-image-tag-race

Conversation

@mesutoezdil
Copy link
Copy Markdown
Contributor

Summary

  • Always append arch suffix to IMAGE_TAG in docker-build.yml so each workflow writes to its own registry slot
  • Run the merge step unconditionally when pushing so the bare :SHA tag is always a deterministic manifest list
  • Update branch-kubernetes-e2e.yml to pull the amd64-suffixed tag directly

Related Issue

Fixes #1343

Changes

Three concurrent workflows (Branch Kubernetes E2E, Branch E2E Checks, GPU Test) were all writing to the same bare :SHA tag. The merge step was gated on platform_count != 1, so single-arch builds collapsed onto the bare tag and raced with each other. kind load then failed when it found a manifest list but only had one arch locally.

The fix removes the conditional collapse so every build writes to :SHA-<arch>, and the merge step always assembles the bare :SHA tag as a proper manifest list.

Testing

  • CI-only change, validated by reading the workflow logic
  • No code changes, no Rust

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@TaylorMutch
Copy link
Copy Markdown
Collaborator

Thanks for taking this on. I do not think this closes #1343 yet.

Blocking issues:

  1. Kubernetes E2E now preloads ghcr.io/nvidia/openshell/<component>:${{ github.sha }}-amd64, but it still runs tests with IMAGE_TAG: ${{ github.sha }}. e2e/with-kube-gateway.sh passes IMAGE_TAG to Helm as image.tag and supervisor.image.tag, so the chart asks for bare :SHA while kind only has :SHA-amd64 loaded. Since kind nodes do not have GHCR credentials, this can still fail or pull the wrong bare tag.

  2. The bare :SHA race remains. After this change, single-arch Branch Kubernetes E2E and Branch E2E both run the merge job and publish the bare tag, while GPU Test also publishes the bare tag. That is still last-writer-wins across workflows; it just moves the bare write from the build job to the merge job.

  3. The PR description says single-arch merge produces a manifest list, but the workflow uses docker buildx imagetools create --prefer-index=false. For a single source, buildx documents this as a carbon-copy path, so it likely publishes a single-platform image manifest, not a one-entry index/list.

What I would change:

  • Do not let single-arch branch workflows publish the shared bare :SHA tag, or scope their bare tag by workflow.
  • Update branch-kubernetes-e2e to pass IMAGE_TAG: ${{ github.sha }}-amd64 if it preloads the amd64 tag, or retag the preloaded images to the bare SHA before kind load and make sure no other workflow can overwrite it.
  • Keep the release/dev multi-arch path publishing the bare SHA manifest, since release-tag.yml and release-dev.yml consume that tag and I do not see a release pipeline regression there.

Net: the release pipeline looks okay, but this PR should be revised before merge because the CI race is not actually removed and K8s E2E has a tag mismatch.

@TaylorMutch TaylorMutch self-assigned this May 15, 2026
- Add publish-manifest input to docker-build.yml (default true); single-arch
  branch callers set it false so the merge job is skipped and the shared
  bare :SHA tag in GHCR is never written by branch workflows
- branch-kubernetes-e2e: retag :SHA-amd64 to :SHA before kind load so
  Helm's image.tag matches what is loaded in kind containerd
- branch-e2e: pass image-tag as :SHA-arm64 to e2e-test so the arch-specific
  GHCR tag is used directly without depending on the bare tag
- bare :SHA in GHCR is now written only by test-gpu.yml (multi-arch build),
  eliminating the last-writer-wins race across concurrent workflows
@mesutoezdil mesutoezdil force-pushed the fix/1343-image-tag-race branch from a53f726 to e9968cb Compare May 16, 2026 09:50
@mesutoezdil
Copy link
Copy Markdown
Contributor Author

mesutoezdil commented May 16, 2026

Thanks for taking this on. I do not think this closes #1343 yet.

Blocking issues:

  1. Kubernetes E2E now preloads ghcr.io/nvidia/openshell/<component>:${{ github.sha }}-amd64, but it still runs tests with IMAGE_TAG: ${{ github.sha }}. e2e/with-kube-gateway.sh passes IMAGE_TAG to Helm as image.tag and supervisor.image.tag, so the chart asks for bare :SHA while kind only has :SHA-amd64 loaded. Since kind nodes do not have GHCR credentials, this can still fail or pull the wrong bare tag.
  2. The bare :SHA race remains. After this change, single-arch Branch Kubernetes E2E and Branch E2E both run the merge job and publish the bare tag, while GPU Test also publishes the bare tag. That is still last-writer-wins across workflows; it just moves the bare write from the build job to the merge job.
  3. The PR description says single-arch merge produces a manifest list, but the workflow uses docker buildx imagetools create --prefer-index=false. For a single source, buildx documents this as a carbon-copy path, so it likely publishes a single-platform image manifest, not a one-entry index/list.

What I would change:

  • Do not let single-arch branch workflows publish the shared bare :SHA tag, or scope their bare tag by workflow.
  • Update branch-kubernetes-e2e to pass IMAGE_TAG: ${{ github.sha }}-amd64 if it preloads the amd64 tag, or retag the preloaded images to the bare SHA before kind load and make sure no other workflow can overwrite it.
  • Keep the release/dev multi-arch path publishing the bare SHA manifest, since release-tag.yml and release-dev.yml consume that tag and I do not see a release pipeline regression there.

Net: the release pipeline looks okay, but this PR should be revised before merge because the CI race is not actually removed and K8s E2E has a tag mismatch.

Thx! Kind tag mismatch, Bare:SHA and --prefer-index=false on single source are done.
And release pipeline is untouched. I hope it will be merged now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(ci): image-tag race between concurrent workflows breaks Branch Kubernetes E2E

2 participants